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Computer Vision is Deep Learning 
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Images are Numbers 
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What the computer sees 
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* Regression: The output variable takes continuous values 


e Classification: The output variable takes class labels 


e Underneath it may still produce continuous values such as 
probability of belonging to a particular class. 
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Computer Vision with Deep Learning: 


Our intuition about what's “hard” is flawed (in complicated ways) 


Visual perception: 540,000,000 years of data 
Bipedal movement: 230,000,000 years of data 
Abstract thought: — 100,000 years of data 





Prediction: Dog + Distortion Prediction: Ostrich 


“Encoded in the large, highly evolve sensory and motor portions of the human brain is a billion years of experience about the nature of 
the world and how to survive in it.... Abstract thought, though, is a new trick, perhaps less than 100 thousand years old. We have not yet 
mastered it. It is not all that intrinsically difficult; it just seems so when we do it.” 


- Hans Moravec, Mind Children (1988) | 
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Neuron: Biological Inspiration for Computation 
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* Neuron: computational building 
block for the brain 


dH wo 


synapse 
axon from a neuron 
UoZo 
dendrite 


cell body 


» wizi +b 







f (Lua å ) 


output axon 


activation 
function 


U121 






* (Artificial) Neuron: computational 
building block for the "neural network" 
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Differences (among others): 

* Parameters: Human brains have ~10,000,000 
times synapses than artificial neural networks. 

e Topology: Human brains have no “layers”. 
Topology is complicated. 

e Async: The human brain works 
asynchronously, ANNs work synchronously. 

* Learning algorithm: ANNs use gradient 
descent for learning. Human brains use ... (we 
don’t know) 

* Processing speed: Single biological neurons 
are slow, while standard neurons in ANNs are 
fast. 

* Power consumption: Biological neural 
networks use very little power compared to 
artificial networks 

e Stages: Biological networks usually don't stop 
/ start learning. ANNs have different fitting 
(train) and prediction (evaluate) phases. 


Similarity (among others): 
e Distributed computation on a large scale. 
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The Reticular Formation 


kee Human Vision 


Its structure is instructive and inspiring! 
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Reticular formation impulses 
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sensory tracts 
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Visual Cortex 


(Its Structure is Instructive and Inspiring) 
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Deep Learning is Hard: 


lllumination Variability 
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Deep Learning is Hard: 


Pose Variability 





Parkhi et al. "The truth about cats and dogs.” 2011. 
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Deep Learning is Hard: 


Intra-Class Variability 
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Philosophical Ambiguity: 
“Image Classification” is not (yet) “Understanding” 
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Image Classification Pipeline 
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Famous Computer Vision Datasets 
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Let's Build an Image Classifier for CIFAR-10 
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Let's Build an Image Classifier for CIFAR-10 


test image training image pixel-wise absolute value differences 
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Accuracy 

Random: 1096 

Our image-diff (with L1): 38.696 
Our image-diff (with L2): 35.496 
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K-Nearest Neighbors: Generalizing the Image-Diff Classifier 


the data NN classifier 5-NN classifier 
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K-Nearest Neighbors: Generalizing the Image-Diff Classifier 


test data 
' 
fold 1 fold 2 fold 3 fold 4 fold 5 test data 


Accuracy 

Random: 10% 

Training and testing on the same data: 35.4% 
7-Nearest Neighbors: "30% 

Human: "95% 


Convolutional Neural Networks: “97.75% 
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Reminder: Weighing the Evidence 
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Reminder: “Learning” is Optimization of a Function 


forward pass W Supervised Learning 
log probabilities 


| correct action 
gradients label = O 





(correct label is provided) 


block of differentiable compute 
image 


(e.g. neural net) 








backward pass 


Ground truth for “6”: 
y(x) = (0,0,0,0,0,0,1,0,0, 0)^ 


"Loss" function: 


C(w,b) = =Y lu) al 
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References: [80] 





hidden layer 


(n = 15 neurons) 
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Convolutional Neural Networks 


Regular neural network (fully connected): 
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Convolutional neural network: 
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Each layer takes a 3d volume, produces 3d volume with some 


smooth function that may or may not have parameters. 
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Convolutional Neural Networks: Layers 


e INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and 
with three color channels R,G,B. 


e CONV layer will compute the output of neurons that are connected to local regions in the input, each computing 
a dot product between their weights and a small region they are connected to in the input volume. This may 
result in volume such as [32x32x12] if we decided to use 12 filters. 


e RELU layer will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves 
the size of the volume unchanged ([32x32x12]). 


e POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in 
volume such as [16x16x12]. 


e FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of 
the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary 
Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the 
previous volume. 


CONV | CONV CONV | CONV CONV | CONV 


Layers highlighted in blue 
have learnable parameters. 
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Dealing with Images: Local Connectivity 
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Same neuron. Just more focused (narrow “receptive field”). 


The parameters on a each filter are spatially “shared” 
(if a feature is useful in one place, it’s useful elsewhere) 
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ConvNets: Spatial Arrangement of Output Volume 





* Depth: number of filters 
e Stride: filter step size (when we “slide” it) 
* Padding: zero-pad the input 
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Input Volume (+pad 1) (7x7x3) 
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Input Volume (+pad 1) (7x7x3) Filter WO (3x3x3) Filter W1 (3x3x3) Output Volume (3x3x2) 
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Convolution 
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-1 -1 -1 
-1 8 -1 
-1 -1 -1 
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Convolution: Representation Learning 
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ConvNets: Pooling 


Single depth slice 


max pool with 2x2 filters 
and stride 2 DD 
H` 





224x224x04 


112x112x64 
pool ZZ å 


downsampling me di 
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Same Architecture, Many Applications 
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This part might look different for: 

* Different image classification domains 

* Image captioning with recurrent neural networks 

* Image object localization with bounding box 

* Image segmentation with fully convolutional networks 


* Image segmentation with deconvolution layers 
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Object Recognition 
Case Study: ImageNet 


C; S, C, S. ny a 
mput feature maps feature maps feature maps feature maps output 
32x 32 28 x 28 14 x 14 10x 10 5x5 





“ subsampling “ connected 
feature extraction classification 


amphibian| — 
fireboat| bumper car snow leopard 
drilling platform | golfcart Egyptian cat 
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What is ImageNet? 


* ImageNet: dataset of 14+ million images (21,841 categories) 


e Let's take the high level category of fruit as an example: 
* Total 188,000 images of fruit 
* There are 1206 Granny Smith apples: 
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What is ImageNet? 





Dataset * ImageNet: dataset of 14+ million images 


Competition ———— + ILSVRC: ImageNet Large Scale Visual Recognition 
Challenge 


Networks ————— + AlexNet (2012) 
* ZFNet (2013) 
e VGGNet (2014) 
* GoogLeNet (2014) 
* ResNet (2015) 
e CUlmage (2016) 
e SENet (2017) 
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ILSVRC Challenge Evaluation for Classification 


e Top 5 error rate: 


* You get 5 guesses to get the correct label 


Image classification 


Steel drum 


y 





Ground truth 


e "20% reduction in accuracy for Top 1 vs Top 5 


p 
Steel drum 
Folding chair 
Loudspeaker 





Accuracy: 1 





Scale 
T-shirt 
Steel drum 
Drumstick 
Mud turtle 


Accuracy: 1 








Scale 
T-shirt 
Giant panda 
Drumstick 
Mud turtle 


Accuracy: 0 


e Human annotation is a binary task: “apple” or “not apple” 


[lap ree References: [123] 
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e  AlexNet (2012): First CNN (15.4%) 





0.3 e 8 layers 
e 61 million parameters 
8 es e ZFNet (2013): 15.4% to 11.2% 
E 0.2 e 8 layers 
= e More filters. Denser stride. 
© 0.15 
y e VGGNet (2014): 11.2% to 7.3% 
= BN e Beautifully uniform: 
a 16.7% 23.3% 4 3x3 conv, stride 1, pad 1, 2x2 max pool 
O 005 e 16 layers 
0 Lëtze e 138 million parameters 


2010 2011 2012 2013 2014 2015 2016 2017 e GoogLeNet (2014): 11.2% to 6.7% 


e Inception modules 
e 22 layers 


e 5 million parameters 
(throw away fully connected layers) 


e Human error: 5.195 e ResNet (2015): 6.7% to 3.57% 


e More layers = better performance 


* Surpassed in 2015 + 152 layers 


e  CUImage (2016): 3.57% to 2.9996 


e Ensemble of 6 models 


* SENet (2017): 2.99% to 2.251% 


e Squeeze and excitation block: network 
is allowed to adaptively adjust the 
weighting of each feature map in the 
convolutional block. 
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e AlexNet (2012): First CNN (15.4%) 
e 8layers 
e 61 million parameters 


e ZFNet (2013): 15.4% to 11.2% 
e 8layers 


"— * More filters. Denser stride. 
ImageNet Classification Error (Top 5) 


" * VGGNet (2014): 11.2% to 7.3% 


e Beautifully uniform: 
3x3 conv, stride 1, pad 1, 2x2 max pool 


e 16 layers 
e 138 million parameters 


* GoogLeNet (2014): 11.2% to 6.7% 
*. Inception modules 
e 22 layers 





e 5 million parameters 
2012 (AlexNet) 2013 (ZF) 2014 (VEGG) 2014 (GoogLeNet) 2015 (ResNet) (th row away fully connected layers) 
e ResNet (2015): 6.7% to 3.5796 

e More layers = better performance 

e 152 layers 


e CUlmage (2016): 3.57% to 2.99% 
e Ensemble of 6 models 
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3x3x384 9216x1x1 4096x1x1 4096 x 1x1 


|| Max Pooling layer a Fully Connected layer 





Krizhevsky et al. "Imagenet classification with deep convolutional neural 
networks." Advances in neural information processing systems. 2012. 
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AlexNet (2012): First CNN (15.4%) 
e 8 layers 
e 61 million parameters 


ZFNet (2013): 15.4% to 11.2% 
e 8 layers 
e More filters. Denser stride. 


VGGNet (2014): 11.2% to 7.3% 


e Beautifully uniform: 
3x3 conv, stride 1, pad 1, 2x2 max pool 


* 16 layers 
e 138 million parameters 


GoogLeNet (2014): 11.2% to 6.7% 

*. Inception modules 

e 22 layers 

e 5 million parameters 

(throw away fully connected layers) 

ResNet (2015): 6.7% to 3.57% 

* More layers = better performance 

* 152 layers 


CUlmage (2016): 3.57% to 2.99% 
e Ensemble of 6 models 
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: Image input 





























AlexNet Conv : Convolutional layer 
A & 
— o H SE i 
5 OTe 1 2 119 | |2 E 2 Pool |: Max-pooling layer 
E =] 5 o =] o o O E  — 
= < < Es < 2 T 
FC : Fully-connected layer 
E E EEG EG = 
Softmax |: 
< $ $ $ $ $ $ : Softmax layer 
NO oo + c o T 
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Simonyan et al. "Very deep convolutional networks 
for large-scale image recognition." 2014. 
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AlexNet (2012): First CNN (15.496) 
e 8 layers 
e 61 million parameters 


ZFNet (2013): 15.4% to 11.2% 
e 8layers 
e More filters. Denser stride. 


VGGNet (2014): 11.2% to 7.3% 


e Beautifully uniform: 
3x3 conv, stride 1, pad 1, 2x2 max pool 


e 16 layers 
e 138 million parameters 


GoogLeNet (2014): 11.2% to 6.7% 

e Inception modules 

e 22 layers 

e 5 million parameters 

(throw away fully connected layers) 

ResNet (2015): 6.7% to 3.57% 

e More layers = better performance 

e 152 layers 


CUlmage (2016): 3.57% to 2.99% 
e Ensemble of 6 models 
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* AlexNet (2012): First CNN (15.4%) 
e 8layers 
e 61 million parameters 


e ZFNet (2013): 15.4% to 11.2% 
e 8layers 









Convolution 


Pooling * More filters. Denser stride. 
Softmax 


e VGGNet (2014): 11.2% to 7.3% 


e Beautifully uniform: 
3x3 conv, stride 1, pad 1, 2x2 max pool 


Concat layer 


y a e 16 layers 
e 138 million parameters 
o o 
Å y * GoogLeNet (2014): 11.2% to 6.7% 


*. Inception modules 








Filter 
concatenation 


3x3 convolutions 


1x1 convolutions 
Previous layer 


e 22 layers 










e 5 million parameters 
(throw away fully connected layers) 


E e ResNet (2015): 6.7% to 3.57% 
* More layers = better performance 


5x5 convolutions 
1x1 convolutions 





1x1 convolutions 









e 152 layers 
* CUlmage (2016): 3.57% to 2.99% 
Szegedy et al. "Going deeper with convolutions." Proceedings of the e Ensemble of 6 models 
IEEE Conference on Computer Vision and Pattern Recognition. 2015. 
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Inception Module 


Filter 


concatenation 





1x1 convolutions 3x3 convolutions 3x3 max pooling 











* Process: do different size convolutions, and concatenate 


e Convolution sizes: 
e Smaller convolutions: local features 
* Larger convolutions: high-abstracted features 


* Result: Fewer parameters and better performance 
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VGG-19 
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output 
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34-layer residual 
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3x3 conv, 512, /2 





[ 3x3 conv, 512 






























































Input 





Convolution 


Batch Norm 


Convolution 


Batch Norm 








Addition 





mm. 


Output 


He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings 


of the IEEE Conference on Computer Vision and Pattern Recognition. 2 
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AlexNet (2012): First CNN (15.496) 
e 8 layers 
e 61 million parameters 


ZFNet (2013): 15.4% to 11.2% 
e 8layers 
e More filters. Denser stride. 


VGGNet (2014): 11.2% to 7.3% 


e Beautifully uniform: 
3x3 conv, stride 1, pad 1, 2x2 max pool 


* 16 layers 
e 138 million parameters 


GoogLeNet (2014): 11.2% to 6.7% 

*. Inception modules 

e 22 layers 

e 5 million parameters 

(throw away fully connected layers) 

ResNet (2015): 6.7% to 3.57% 

e More layers = better performance 

e 152 layers 


CUlmage (2016): 3.57% to 2.99% 
e Ensemble of 6 models 


Lex Fridman January 
lex.mit.edu 2018 


MIT 6.5094: Deep Learning for Self-Driving Cars 
https://selfdrivingcars.mit.edu 


Residual Block — Sh pr 








E | i sea po i D | 
weight layer | E 





weight layer / identity 

















A 


Fix) tx È 





e Initial Observation: 


* Network depth often increases 


representation power, but is harder to 
train. 


* Residual Block: 


*. Repeat a simple network block (think: RNN) 


*. Pass input along without transformation: help 
ensure that each layer learns something new 
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SENet: Squeeze-and-Excitation Networks 


Fa (:W) 


Reie C) 





* Content-aware channel weighting: Add parameters to each 
channel of a convolutional block so that the network can 
adaptively adjust the weighting of each feature map 


* This approach is simple and can be added to any model 


* Takeaway for thought: Parameterize everything (that's cost-effective) 
including higher-order hyper-parameters. 
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2018 


Capsule Networks (Hinton) 


| | 

| y - | 

| <O) (O) > | 
| _ - ” / | | 
| | | | 
| j 


* A CNN see both images as the same. The problem 
e Internal data representation of a convolutional neural network does not 
take into account important spatial hierarchies between simple and 


complex objects. 
* See upcoming online-only lecture on capsule networks. 


January 


Lex Fridman 
2018 
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This part might look different for: 

* Different image classification domains 

* Image captioning with recurrent neural networks 

* Image object localization with bounding box 

* Image segmentation with fully convolutional networks 


* Image segmentation with deconvolution layers 
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Object Detection 





KN KW ” 
5 1 ) FA E de 
SNP NN 
> Å | 

ES h ^ 


1. Input 2. Extract region 3. Compute 4. Classify 
image proposals (-2k) CNN features regions 
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Fully Convolutional Networks 


e Goal: Classify every pixel in an image. 
e Difficulty: Hard 
e Why? 


*. When precise boundaries of objects matter (medical, driving) 
*. Useful for fusing with other sensors (LIDAR) 
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FCN (Nov 2014) 


Paper: “Fully Convolutional Networks for Semantic Segmentation” 


* Repurpose Imagenet pretrained nets 
* Upsample using deconvolution 


e Skip connections to improve coarseness of upsampling 


“tabby cat” 


rr 9e gp n NR | 
Bam li 


\ 


convolutionalization 


M Í Ko ob ei 
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tabby cat heatmap 
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SegNet (Nov 2015) 


Paper: “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation” 


* Maxpooling indices transferred to decoder to improve the 
segmentation resolution. 


Convolutional Encoder-Decoder 


Input Output 










Pooling Indices 





RGB Image ' Conv + Batch Normalisation + Reu | Segmentation 
ME Pooling MN Upsampling ` Softmax 
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Dilated Convolutions (Nov 2015) 


Paper: “Multi-Scale Context Aggregation by Dilated Convolutions” 


* Since pooling decreases resolution: 
* Added “dilated convolution layer” 


e Still interpolate up from 1/8 of original image size 
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DeepLap v1, v2 (Jun 2016) 


Paper: “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and 
Fully Connected CRFs" 


* Added fully-connected Conditional Random Fields (CRFs) — as a 
post-processing step 
e Smooth segmentation based on the underlying image intensities 


Input DCNN WG e Coarse 
——— core ma 











Atrous Convolution 





Y 
Final Output Fully Connected CRF Bi-linear Interpolation 








~< 
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Key Aspects of Segmentation 


* Fully convolutional networks (FCNs) - replace fully-connected 
layers with convolutional layers 


*. Deeper, updated models (now ResNet) consistent with ImageNet 
Challenge object classification tasks. 


* Conditional Random Fields (CRFs) to capture both local and 
long-range dependencies within an image to refine the 
prediction map. 


e Dilated convolution (aka Atrous convolution) — maintain 
computational cost, increase resolution of intermediate feature 
maps 
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ResNet-DUC (Nov 2017) 


Paper: “Understanding Convolution for Semantic Segmentation” 


* Dense upsampling convolution (DUC) ar A 
: M . ¡[MM 5 EEE. See 
instead of bilinear upsampling HHH E IAN 

* Learnable: Learn the upscaling filters HEHH AO Haa 


e Hybrid dilated convolution (HDC) i E i 


e Use a different dilation rate 





Hybrid Dilated Conv. (HDC) 
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FlowNet (May 2015) 


Paper: “FlowNet: Learning Optical Flow with Convolutional Networks ” 


e Learn flow from image-pair, end to end. 
* FlowNetS — stacks two images as input 
* FlowNetC — convolute separately, combine with correlation layer 


FlowNetSimple 





Fig. 1 
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FlowNet 2.0 (Dec 2016) 


Paper: “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks” 


e Stack FlowNetS and FlowNetC e Observations: 


* Improvement over FlowNet e Stacking networks as an approach 


e Smooth flow fields * Order of training dataset matters 


e Preserves fine-motion detail 
e Runs at 8-140fps 


FlowNet FlowNet 2.0 
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SegFuse: Dynamic Driving Scene Segmentation 
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SegFuse: Dynamic Driving Scene Segmentation 


ST. ON 8 
mw MEN L 4 


cars.mit.ed 


5rjs. cn 








| | | ag EE For the full updated list of references visit: MIT 6.5094: Deep Learning for Self-Driving Cars 
) i Technology https://selfdrivingcars.mit.edu/refe https://selfdrivingcars.mit.edu 


SegFuse: Dynamic Driving Scene Segmentation 
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SegFuse: Dynamic Driving Scene Segmentation 








Optical Flow 
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SegFuse: Dynamic Driving Scene Segmentation 
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Optical Flow 
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Thank You 


Tomorrow: Waymo Next lecture: Deep Learning for Human Sensing 


O 
Sacha Arnoud MA -A A À 


P 
Director of Engineering, Waymo fi Wm. A 





CONVERSATION MEDIUM CONVERSATION CONVERSATION 


Upcoming online-only lectures: 
- Capsule networks 
- Generative adversarial networks 
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